Exploring Prosper Loans Data

What determines a good Prosper Score?

by Anthony Odiba

Investigation Overview

In this this investigation, I looked at the characteristics of prosper loan variables to see what determins if a loan will perform well or not.

I explored certain variables and how they interact to try to see if they can affect the loan status and to what extent. Overall my goal was to try to see how the the different variables in the dataset relate with Prosper Score and Credit Grade, because these two variables determins are meant to 'Predict' loan performance.

Dataset Overview

The data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, Prosper Score(A custom risk score which originated after July 2009, it was built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score.), Credit Grade(The Credit rating used pre-2009 period, when the listing went live.) and many others.

The data set was last updated on 03/11/2014

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
import matplotlib.cm, matplotlib.colors
import seaborn as sb
import geopandas as gpd, folium, requests,fiona, branca, json
import missingno as msno
from wordcloud import WordCloud,STOPWORDS
from PIL import Image
import urllib,requests,random,palettable
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly_express as px
%matplotlib inline
%config InlineBackend.figure_format ='retina'

# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")
In [2]:
# Some Style Functions used
prosper=("#FE7702","#E10C79") 
prosper_hue2 =['#84807A',"#FE7702","#E10C79"]
prosper_hue3 =['#84807A',"#FE7702","#bfff00","#E10C79","#0000ff","#FFC300","#4A235A"]

def prosper_hue(word,font_size,position,orientation,random_state=None, **kwargs):
    return np.random.choice(prosper)  

#cell block containing various color elements used in notebook
def set_my_style():
    sb.set()
    hue_palette=sb.color_palette(prosper_hue2)
    sb.set_palette(hue_palette)

#used palettable colors because it's more vibrant than the one seaborn has to offer
prosper_hue3_plasma=palettable.matplotlib.Plasma_3.hex_colors
prosper_hue5_plasma=palettable.matplotlib.Plasma_5_r.hex_colors
prosper_hue_medium=palettable.matplotlib.Inferno_8.hex_colors
prosper_hue_medium_plasma=palettable.matplotlib.Plasma_8.hex_colors
prosper_hue_medium_plasma_r=palettable.matplotlib.Plasma_8_r.hex_colors
prosper_hue_11=palettable.matplotlib.Inferno_11.hex_colors
prosper_hue_11_plasma=palettable.matplotlib.Plasma_11.hex_colors
prosper_hue_many=palettable.matplotlib.Inferno_20.hex_colors #for plots with many distinct categories
prosper_hue_many_plasma=palettable.matplotlib.Plasma_11_r.hex_colors
In [3]:
# load in the dataset into a pandas dataframe
loan=pd.read_csv('prosperLoanData.csv')

Note that the above cells have been set as "Skip"-type slides. That means that when the notebook is rendered as http slides, those cells won't show up.

Map showing Number of Loans per State

From the map visualisation, we can see that most Prosper borrowers come from California.

I wanted to find out why and I did a little search and it turns out, the company was founded in California, USA in 2005, it's headquarters is also San Francisco.

I thought the situation of their offices was a possible reason but then I checked their other office location: Phoenix, Arizona and there, they had just 1901 Prosper Borrowers so i abandoned that theory.

Lastly I checked U.S.A's population distribution by state and the Prosper Borrower number mirrors that.

In [4]:
gdf = gpd.read_file("https://raw.githubusercontent.com/PublicaMundi/MappingAPI/master/data/geojson/us-states.json")
statesabbr=pd.read_json("https://gist.githubusercontent.com/mshafrir/2646763/raw/8b0dbb93521f5d6889502305335104218454c2bf/states_titlecase.json")
gdf1=pd.merge(left=gdf,right=statesabbr,how='left',on=['name'])
gdf1.rename(columns={'abbreviation':'BorrowerState'},inplace=True)
loan1=loan.copy()
gdf_test_df=gdf1.copy()
loan_test_df=loan1.groupby('BorrowerState',as_index=False)['ListingKey'].count()
gdf_test_df=pd.merge(left=gdf_test_df,right=loan_test_df,how='left',on=['BorrowerState'])
gdf_test_df.crs = fiona.crs.from_epsg(4326)
gdf_test_df=gdf_test_df[gdf_test_df['name']!='District of Columbia']
gdf_test_df.loc[51,'ListingKey']=0
In [5]:
variable = 'ListingKey'
gdf_test_df=gdf_test_df.sort_values(by=variable, ascending=True)
colormap = folium.LinearColormap(colors=["red","orange","yellow","green"],vmin=gdf_test_df.loc[gdf_test_df[variable]>0, variable].min(), 
                                 vmax=gdf_test_df.loc[gdf_test_df[variable]>0, variable].max()).to_step(n=5)
centroid=gdf_test_df.geometry.centroid
m=folium.Map(location=[centroid.y.mean(), centroid.x.mean()], zoom_start=4, tiles='StamenWatercolor')

#tooltip=folium.features.Tooltip()
#folium.

folium.GeoJson(gdf_test_df[['geometry','name',variable]],
               name="United States of America",
               style_function=lambda x: {"weight":2, 'color':'black','fillColor':colormap(x['properties'][variable]), 'fillOpacity':0.2},
              highlight_function=lambda x: {'weight':3, 'color':'black'},
               smooth_factor=2.0,
               #tooltip=tooltip
              tooltip=folium.features.GeoJsonTooltip(fields=['name',variable],
                                            aliases=['State','Number of loans by state'], 
                                              labels=True, 
                                              sticky=True,
                                             )
              ).add_to(m)


colormap.add_to(m)

folium.LayerControl(autoZIndex=False, collapsed=False).add_to(m)
m
Out[5]:

Lets see what listed profession are borrowing the most

From the word cloud and the occupation count plot, When we remove two highest Occupation classifications, we see that Computer Programmer is the Highest and it gradually reduces till the last which is Student-Technical School. From out dataset, the two most common Occupations are 'other' and 'professional';which is a broad classification and doesnt tell us much.

In [6]:
img_link2='https://raw.githubusercontent.com/tonyodiba/Prosper_Loans_data_exploration/master/files/326-3267244_prosper-loans-on-twitter-prosper-marke.png'
icon=Image.open(requests.get(img_link2,stream=True).raw).convert("RGBA")

mask = Image.new("RGBA", icon.size, (255,255,255))
mask.paste(icon,icon)
mask = np.array(mask)

wc=loan.Occupation.value_counts().to_dict()
wordcloud=WordCloud(width = 400, height = 400,
                    background_color='white',max_font_size=300,
                    random_state=123,stopwords=STOPWORDS,mask=mask).generate_from_frequencies(wc)
#wordcloud.recolor(color_func=color_func, random_state=7)
wordcloud.recolor(color_func=prosper_hue,random_state=53)
plt.figure(figsize=(10,10),facecolor = 'white', edgecolor='blue')
plt.imshow(wordcloud, interpolation='bilinear')
plt.tight_layout(pad=2)
plt.axis("off")
plt.show()

Which Income Bracket and Employment Status Group borrowed the Most ?

The Income Bracket borrowing the most from Prosper are the \$25,000-\$49,000, followed closely \$50,000-\$74,000 Income Bracket. When grouped by Employment Status, people listed as Employed borrowed more and "Retired" borrowed least.

In [7]:
loan.IncomeRange=loan.IncomeRange.astype('category')
plt.figure(figsize = [15, 4.5])
plt.subplots_adjust(wspace = 0.3)

plt.subplot(121)
set_my_style()
sb.set_context('notebook')
ax=loan.IncomeRange.value_counts().plot.bar(alpha=0.9,color=prosper_hue2[1])
plt.title('Income Rate Distribution')
plt.xlabel('Income Bracket')
plt.ylabel('Count')

plt.subplot(122)
ax=loan.EmploymentStatus.value_counts().plot.bar(alpha=0.9,color=prosper_hue2[0])
plt.title('Employment Status')
#plt.xlabel('Loan Status')
plt.ylabel('Count')
plt.xlabel('Employment Status')

plt.show()
In [8]:
# "ProsperRating (numeric)","ProsperRating (Alpha)","ProsperScore","CreditGrade" into ordered categorical types
ordinal_var_dict = {'ProsperScore': [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0],
                    'ProsperRating (numeric)': [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
                    'ProsperRating (Alpha)': ['HR', 'E', 'D', 'C', 'B', 'A', 'AA'],
                    'CreditGrade':['NC', 'HR', 'E', 'D', 'C', 'B', 'A', 'AA']
                   }

for var in ordinal_var_dict:
    ordered_var = pd.api.types.CategoricalDtype(ordered = True,
                                                categories = ordinal_var_dict[var])
    loan[var] = loan[var].astype(ordered_var)

#classified the LoanStatus into performing and NonPerforming.
loan['Loan_stat1']=(loan.LoanStatus.isin(['Completed','Current','FinalPaymentInProgress'])).astype('category')
#rename the newly created loan_stat category column
loan.Loan_stat1.cat.rename_categories({False:'NonPerforming',True:'Performing'},inplace=True)
#change LoanStatus to category type
loan.LoanStatus=loan.LoanStatus.astype('category')
#make new loanStatus columns
loan['Loan_stat2']=loan['LoanStatus']
loan['Loan_risk']=loan['LoanStatus']
#create dictionary used to rename column
clean_Loan_stat_risk={'Loan_stat2':{'FinalPaymentInProgress':'Completed','Cancelled': 'Completed',
                                    'Past Due (1-15 days)': 'PastDue','Past Due (16-30 days)':'PastDue',
                                    'Past Due (31-60 days)':'PastDue','Past Due (61-90 days)':'PastDue',
                                    'Past Due (91-120 days)':'PastDue','Past Due (>120 days)':'PastDue'},
                      
                      'Loan_risk':{'FinalPaymentInProgress':'Completed','Cancelled': 'Completed',
                                   'Past Due (1-15 days)': 'HighRisk','Past Due (16-30 days)':'HighRisk',
                                   'Past Due (31-60 days)':'HighRisk','Past Due (61-90 days)':'HighRisk',
                                   'Past Due (91-120 days)':'HighRisk','Past Due (>120 days)':'HighRisk',
                                   'Defaulted':'HighRisk','Chargedoff':'HighRisk'}}
#use replace to change categorys values
loan.replace(clean_Loan_stat_risk,inplace=True)
#change type
loan.Loan_stat2=loan.Loan_stat2.astype('category')
loan.Loan_risk=loan.Loan_risk.astype('category')

Prosper Score Vs Credit Grade

From the plot of both ProsperScore and CcreditGrade vs LoanStatus(i.e Performing and NonPerforming Loans), I observed that Prosper Score was wayy more efficient at determining the performance of a loan than creditGrade was. This indicates that the risk management system at Prosper has actually gotten better after the restructuring.

In [9]:
plt.figure(figsize=(15,4.4))
plt.subplots_adjust(wspace = 0.5)
plt.subplot(121)
ax=sb.countplot(x='ProsperScore',data=loan,hue='Loan_stat1',hue_order=['Performing','NonPerforming'],alpha=0.9,
                palette=['#FE7702', '#E10C79'])
plt.subplot(122)
ax=sb.countplot(x='CreditGrade',data=loan,hue='Loan_stat1',hue_order=['Performing','NonPerforming'],alpha=0.9,
                palette=['#FE7702', '#E10C79'])

DebtToIncomeRatio vs BorrowerAPR colored by CreditGrade

For BorrowerAPR vs DebttoIncomeRatio, firstly, most people tend to have debt-to-income ratios below 1, regardless of risk category. Iobserved alos that AA category loans seem to have a lower BorrowerAPRs and a smaller range of debt-to-income ratios, both of which indicate less risk and APR increases as the rating gets riskier. Most people tend to have debt-to-income ratios below 1, regardless of risk category.

In [10]:
sb.set_context('talk')
ax=sb.lmplot(data=loan,size=7,aspect=2,x='DebtToIncomeRatio',y='BorrowerAPR',hue='CreditGrade',
               fit_reg=False,palette=prosper_hue_medium_plasma_r)
ax.set_titles('DebtToIncomeRatio vs BorrowerAPR grouped by CreditGrade')
plt.show()

StatedMonthlyIncome vs LoanAmount differentiated by ProsperScore

This plot of the StatedMonthlyIncomw vs LoanOriginalAmount categorized by Risk profile and grouped by colored representing the different Prosper Scores does a good job of showing us how the StatedMonthlyIncomw and LoanOriginalAmount are related.

From the plot we can see that the ordinary leeast square regression line shows an upward trend, meaning that as statedMonthly income increased, the loan amoint increased. this trend is more apparent in the prosperScore 10 group.

It is an interactive plot please highlight the ProsperScore 10 to visualize it.

In [11]:
init_notebook_mode(connected=True)
loan_no_nan=loan.copy()
loan_no_nan2=loan.copy()
loan_no_nan.dropna(subset=['ProsperScore'],inplace=True)
loan_no_nan2.dropna(subset=['CreditGrade'],inplace=True)
In [12]:
fig=px.scatter(loan_no_nan,width=1000,height=700,title='StatedMonthlyIncomw vs LoanOriginalAmount categorized by Riskprofile and ProsperScore',trendline='ols',trendline_color_override='Black',x='LoanOriginalAmount',
           y='StatedMonthlyIncome',color='ProsperScore',log_x=True,facet_col='Loan_risk',
           log_y=True)
plotly.io.show(fig)

BorrowerAPR Vs ProsperScore

This plot shows Borrower Rate vs Prosper Score grouped by Performing vs Non-Performing loans. I classified the LoanStatus into Performing and NonPerforming Categories.

The performing Category consists of;

  • The 'Completed' class
  • The 'Current' class (since the haven't defaulted)
  • The 'FinalPaymentInProgress' class

BorrowerAPR decreases with improving Prosper Score. For performing loans, it was observed that prosperScore of >10 enjoy the least BorrowerAPR, with a relatively short range of about 0.05-0.14, and it seems Home owners enjoy an even smaller BorrowerAPR compared to non-home owners.

For the non-performing loan group, from a prosperScore of 0-5, the borrower Annual Percentage Rate is about the same with variation in range, however from the rating score of >6, it decreases with improving score. Also here, the BorrowerAPR for home owners appears to be lower than for non-home owners.

In [13]:
loan_Borrower=loan.copy()
loan_Borrower2=loan.copy()

loan_Borrower.dropna(subset=['CreditGrade'],inplace=True)
loan_Borrower2.dropna(subset=['ProsperScore'],inplace=True)
In [14]:
fig=px.box(loan_no_nan,width=1000,height=450,x='ProsperScore',color='IsBorrowerHomeowner',
           y='BorrowerAPR',log_x=False,facet_col='Loan_stat1',
           log_y=False)
plotly.io.show(fig)

Parallel plot showing interaction of 9 different variables

This is a fun plot that shows the interaction of 9 different variables;

  • 'Original Loan Amount'
  • 'Loan Status'
  • 'DebtToIncomeRatio'
  • 'Monthly Income(Stated)'
  • 'Recommendations'
  • 'Investors'
  • 'TotalProsperLoans'
  • 'Prosper Score'

From the 'Parallel plot showing interaction of 9 different variables' I observed that they were a lot of loans which had a prosperScore of 9,10 and 11 that were either chargedoff or defaulted,some of these loans with original loan amount between \$20,000 and \$25,000 had monthlyIncome Amounts close to the borrowed amounts.

Feel free to fiddle with the plot to see what insight you may uncover!

In [15]:
loan_parallel=loan.copy()
loan_parallel['Par_col']=loan_parallel['Loan_risk']
loan_parallel['Par_col2']=loan_parallel['Loan_stat2']

clean_Loan_risk={'Par_col':{'Completed':1,'Current': 2,'HighRisk':3},
                 'Par_col2':{'Completed':1,'Current':2, 'PastDue':3, 'Defaulted':4, 'Chargedoff':5}}
loan_parallel.replace(clean_Loan_risk,inplace=True)
In [16]:
init_notebook_mode(connected=True)

df=loan_parallel
data = [
    go.Parcoords(
        line = dict(color = df['Par_col'],
                   colorscale = [[0,'#84807A'],[0.5,"#FE7702"],[1,'#E10C79']]),
        #line = dict(color = df['colorVal'],
        #           colorscale = 'Jet',
        #           showscale = True,
        #           reversescale = True,
        #           cmin = -4000,
        #           cmax = -100),
        dimensions = list([
            dict(range = [100,35000],
                 constraintrange=[20000,25000],
                 label = 'Original Loan Amount', values = df['LoanOriginalAmount']),
            dict(tickvals = [1,2,3,4,5],
                 constraintrange=[4,5],
                 ticktext = ['Completed', 'Current', 'PastDue', 'Defaulted', 'Chargedoff'],
                 label = 'Loan Status', values = df['Par_col2']),
            dict(range=[0.0,10.01],
                 label = 'DebtToIncomeRatio', values = df['DebtToIncomeRatio']),
            dict(range = [0.0,175500],#monthly income range too high 0.0-1755000 reduce it
                 visible = True,
                 label = 'Monthly Income(Stated)', values = df['StatedMonthlyIncome']),
            dict(range = [0,40],
                 label = 'Recommendations', values = df['Recommendations']),
            dict(range = [1,1189],
                 label = 'Investors', values = df['Investors']),
            dict(range = [0,8],
                 label = 'TotalProsperLoans', values = df['TotalProsperLoans']),
            dict(range = [-1,12],
                 constraintrange=[9,11],
                 tickvals = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0],
                 label = 'Prosper Score', values = df['ProsperScore'])
            #[0.0,646285.0] also add incomerange
        ])
    )
]

layout = go.Layout(
    title='Parallel plot showing interaction of 9 different variables',
    plot_bgcolor = '#E5E5E5',
    paper_bgcolor = '#E5E5E5',
    autosize=False,
    width=1000,
    height=520
)


fig = go.Figure(data = data, layout = layout)
#plotly.offline.iplot(fig,filename='Prosper_loans_parallel')
plotly.io.show(fig)
#plotly.offline.plot(fig, filename = 'Proser_loans.html')

Once you're ready to finish your presentation, check your output by using nbconvert to export the notebook and set up a server for the slides. From the terminal or command line, use the following expression:

jupyter nbconvert <file_name>.ipynb --to slides --post serve --template output_toggle

This should open a tab in your web browser where you can scroll through your presentation. Sub-slides can be accessed by pressing 'down' when viewing its parent slide. Make sure you remove all of the quote-formatted guide notes like this one before you finish your presentation!

In [35]:
!jupyter nbconvert Prosper_loan_exploration_slide_deck.ipynb --to slides --post serve --template output_toggle
[NbConvertApp] Converting notebook Prosper_loan_exploration_slide_deck.ipynb to slides
[NbConvertApp] Writing 12086042 bytes to Prosper_loan_exploration_slide_deck.slides.html
[NbConvertApp] Redirecting reveal.js requests to https://cdnjs.cloudflare.com/ajax/libs/reveal.js/3.5.0
Serving your slides at http://127.0.0.1:8000/Prosper_loan_exploration_slide_deck.slides.html
Use Control-C to stop this server
WARNING:tornado.access:404 GET /custom.css (127.0.0.1) 1.19ms
WARNING:tornado.access:404 GET /custom.css (127.0.0.1) 8.43ms
WARNING:tornado.access:404 GET /custom.css (127.0.0.1) 0.61ms
^C

Interrupted